102 research outputs found
Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study
Analytic, first-principles performance modeling of distributed-memory
applications is difficult due to a wide spectrum of random disturbances caused
by the application and the system. These disturbances (commonly called "noise")
destroy the assumptions of regularity that one usually employs when
constructing simple analytic models. Despite numerous efforts to quantify,
categorize, and reduce such effects, a comprehensive quantitative understanding
of their performance impact is not available, especially for long delays that
have global consequences for the parallel application. In this work, we
investigate various traces collected from synthetic benchmarks that mimic real
applications on simulated and real message-passing systems in order to pinpoint
the mechanisms behind delay propagation. We analyze the dependence of the
propagation speed of idle waves emanating from injected delays with respect to
the execution and communication properties of the application, study how such
delays decay under increased noise levels, and how they interact with each
other. We also show how fine-grained noise can make a system immune against the
adverse effects of propagating idle waves. Our results contribute to a better
understanding of the collective phenomena that manifest themselves in
distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change
Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory
New algorithms and optimization techniques are needed to balance the
accelerating trend towards bandwidth-starved multicore chips. It is well known
that the performance of stencil codes can be improved by temporal blocking,
lessening the pressure on the memory interface. We introduce a new pipelined
approach that makes explicit use of shared caches in multicore environments and
minimizes synchronization and boundary overhead. For clusters of shared-memory
nodes we demonstrate how temporal blocking can be employed successfully in a
hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure
LIKWID: Lightweight Performance Tools
Exploiting the performance of today's microprocessors requires intimate
knowledge of the microarchitecture as well as an awareness of the ever-growing
complexity in thread and cache topology. LIKWID is a set of command line
utilities that addresses four key problems: Probing the thread and cache
topology of a shared-memory node, enforcing thread-core affinity on a program,
measuring performance counter metrics, and microbenchmarking for reliable upper
performance bounds. Moreover, it includes a mpirun wrapper allowing for
portable thread-core affinity in MPI and hybrid MPI/threaded applications. To
demonstrate the capabilities of the tool set we show the influence of thread
affinity on performance using the well-known OpenMP STREAM triad benchmark, use
hardware counter tools to study the performance of a stencil code, and finally
show how to detect bandwidth problems on ccNUMA-based compute nodes.Comment: 12 page
The Kernel Polynomial Method
Efficient and stable algorithms for the calculation of spectral quantities
and correlation functions are some of the key tools in computational condensed
matter physics. In this article we review basic properties and recent
developments of Chebyshev expansion based algorithms and the Kernel Polynomial
Method. Characterized by a resource consumption that scales linearly with the
problem dimension these methods enjoyed growing popularity over the last decade
and found broad application not only in physics. Representative examples from
the fields of disordered systems, strongly correlated electrons,
electron-phonon interaction, and quantum spin systems we discuss in detail. In
addition, we illustrate how the Kernel Polynomial Method is successfully
embedded into other numerical techniques, such as Cluster Perturbation Theory
or Monte Carlo simulation.Comment: 32 pages, 17 figs; revised versio
Validation of hardware events for successful performance pattern identification in High Performance Computing
Hardware performance monitoring (HPM) is a crucial ingredient of performance
analysis tools. While there are interfaces like LIKWID, PAPI or the kernel
interface perf\_event which provide HPM access with some additional features,
many higher level tools combine event counts with results retrieved from other
sources like function call traces to derive (semi-)automatic performance
advice. However, although HPM is available for x86 systems since the early 90s,
only a small subset of the HPM features is used in practice. Performance
patterns provide a more comprehensive approach, enabling the identification of
various performance-limiting effects. Patterns address issues like bandwidth
saturation, load imbalance, non-local data access in ccNUMA systems, or false
sharing of cache lines. This work defines HPM event sets that are best suited
to identify a selection of performance patterns on the Intel Haswell processor.
We validate the chosen event sets for accuracy in order to arrive at a reliable
pattern detection mechanism and point out shortcomings that cannot be easily
circumvented due to bugs or limitations in the hardware
- …